A Fusion of Algorithms in Near Duplicate Document Detection

نویسندگان

  • Jun Fan
  • Tiejun Huang
چکیده

$O+1=$1=,$5&P+;$;,D,32PQ,#1$29$1=,$O253;$O+;,$O,7($1=,5,$&5,$&$ ="/,$#"Q7,5$29$9"338$25$95&/Q,#1&338$;"P3+<&1,;$P&/,E$+#$1=,$R#1,5#,1N$S,1"5#$29$ 1=,E,$#,&5$;"P3+<&1,;$5,E"31E$12$1=,$"E,5E$/5,&138$&99,<1E$"E,5$,TP,5+,#<,EN$R#$1=,$ P52<,EE$29$;,P328+#/$;+/+1&3$3+75&5+,E($1=,$P521,<1+2#$29$+#1,33,<1"&3$P52P,518$&#;$ 5,Q2D&3$29$;"P3+<&1,$<2#1,#1E$#,,;E$12$7,$<2#E+;,5,;N$*=+E$P&P,5$9"E,E$E2Q,$ UE1&1,$29$1=,$&51V$&3/25+1=QE$12$5,&<=$&$7,11,5$P,5925Q&#<,N$O,$9+5E1$+#152;"<,$ 1=,$1=5,,$Q&-25$&3/25+1=QE$WE=+#/3+#/($RXQ&1<=($E+Q=&E=Y$+#$;"P3+<&1,$;2<"Q,#1$ ;,1,<1+2#$&#;$1=,+5$;,D,32PQ,#1E$+#$1=,$92332Z+#/$;&8EN$O,$1&B,$E,[",#<,E$29$ Z25;E$WE=+#/3,EY$&E$1=,$9,&1"5,$29$E+Q=&E=$&3/25+1=QN$O,$1=,#$+QP251$1=,$ 5&#;2Q$3,T+<2#E$7&E,;$Q"31+$9+#/,5P5+#1E$/,#,5&1+2#$Q,1=2;$+#12$E=+#/3+#/$7&E,$ E+Q=&E=$&3/25+1=Q$&#;$#&Q,;$+1$E=+#/3+#/$7&E,;$Q"31+$9+#/,5P5+#1E$E+Q=&E=$ &3/25+1=QN$O,$;+;$E2Q,$P5,3+Q+#&58$,TP,5+Q,#1E$2#$1=,$E8#1=,1+<$;&1&E,1$7&E,;$ 2#$1=,$U@=+#&XC>$\+33+2#$F22B$]+/+1&3$6+75&58$A52-,<1VN$*=,$,TP,5+Q,#1$ 5,E"31$P52D,E$1=,$,99+<+,#<8$29$1=,E,$&3/25+1=QEN$

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Identification of Duplicate News Stories in Web Pages

Identifying near duplicate documents is a challenge often faced in the field of information discovery. Unfortunately many algorithms that find near duplicate pairs of plain text documents perform poorly when used on web pages, where metadata and other extraneous information make that process much more difficult. If the content of the page (e.g., the body of a news article) can be extracted from...

متن کامل

Near Duplicate Text Detection Using Frequency-Biased Signatures

As the use of electronic documents are becoming more popular, people want to find documents completely or partially duplicate. In this paper, we propose a near duplicate text detection framework using signatures to save space and query time. We also propose a novel signature selection algorithm which uses collection frequency of q-grams. We compare our algorithm with Winnowing, which is one of ...

متن کامل

New Issues in Near-duplicate Detection

Near-duplicate detection is the task of identifying documents with almost identical content. The respective algorithms are based on fingerprinting; they have attracted considerable attention due to their practical significance for Web retrieval systems, plagiarism analysis, corporate storage maintenance, or social collaboration and interaction in the World Wide Web. Our paper presents both an i...

متن کامل

A Near-duplicate Detection Algorithm to Facilitate Document Clustering

Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting Near Duplicates is very difficult in large collection of data like ”internet”. The presence of these web pages plays an important role in the performance degradation while integrating data from heterogeneous sources. These pages either increase the index storage space or increase the serving costs. Detecting t...

متن کامل

Adaption of String Matching Algorithms for Identification of Near-Duplicate Music Documents

The number of copyright registrations for music documents is increasing each year. Computer-based systems may help to detect near-duplicate music documents and plagiarisms. The main part of the existing systems for the comparison of symbolic music are based on string matching algorithms and represent music as sequences of notes. Nevertheless, adaptation to the musical context raises specific pr...

متن کامل

Near Duplicate Document Detection Using Document-Level Features and Supervised Learning

This paper addresses the problem of Near Duplicate document. Propose a new method to detect near duplicate document from a large collection of document set. This method is classified into three steps. Feature selection, similarity measures and discriminant function. Feature selection performs pre-processing; calculate the weight of each terms and heavily weighted term is selected as a features ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011